AMST2: aggregated multi-level spatial and temporal context-based transformer for robust aerial tracking

Recently, many existing visual trackers have made significant progress by incorporating either spatial information from multi-level convolution layers or temporal information for tracking. However, the complementary advantages of both spatial and temporal information cannot be leveraged when these two types of information are used separately. In this paper, we present a new approach for robust visual tracking using a transformer-based model that incorporates both spatial and temporal context information at multiple levels. To integrate the refined similarity maps through multi-level spatial and temporal encoders, we propose an aggregation encoder. Consequently, the output of the proposed aggregation encoder contains useful features that integrate the global contexts of multi-level spatial and the temporal contexts. The feature we propose offers a contrasting yet complementary representation of multi-level spatial and temporal contexts. This characteristic is particularly beneficial in complex aerial scenarios, where tracking failures can occur due to occlusion, motion blur, small objects, and scale variations. Also, our tracker utilizes a light-weight network backbone, ensuring fast and effective object tracking in aerial datasets. Additionally, the proposed architecture can achieve more robust object tracking against significant variations by updating the features of the latest object while retaining the initial template information. Extensive experiments on seven challenging short-term and long-term aerial tracking benchmarks have demonstrated that the proposed tracker outperforms state-of-the-art tracking methods in terms of both real-time processing speed and performance.

www.nature.com/scientificreports/ increasingly popular due to its ability to incorporate both spatial and temporal context information in a flexible and efficient manner, enabling better tracking performance in various scenarios. Most transformer-based trackers adopt a process of feeding the transformer with features extracted from the backbone network [47][48][49][50]54,55 . Inspired by the main idea of the transformer, TransT proposed a feature fusion network composed of an ego-context augmentation module with self-attention and a cross-feature augment module with cross-attention 47 . As a useful feature of the output of the feature fusion network, the final tracking result is obtained through classification and box regression processes. TrDiMP utilizes the DiMP model predictor and generates model weights by using the output features of the transformer encoder as training samples 48 . After that, the target model calculates the target score map by applying the predicted weights to the output features generated by the transformer decoder. TrDiMP incorporates a probabilistic IoUNet for bonding box regression and also introduces TrSiam, which formulates the proposed model into a Siamese-like pipeline. STARK, as proposed in 49 , is a tracker using an end-to-end transformer architecture based on DETR 58 . The model learns robust spatio-temporal representations by leveraging the global relationships in both spatial and temporal information through the encoder, which extracts discriminative spatio-temporal features that are fed into the decoder. Furthermore, this tracker eliminates the need for post-processing techniques such as cosine window or bounding box smoothing, thereby simplifying the existing tracking pipeline. ToMP predicts the weight of the convolutional kernel for object localization using a transformer-based model prediction module to overcome the limitations of the existing optimization-based target localization 50 . The transformer-based target model predictor can avoid unnecessary repetitive optimization and dynamically generate discriminative features using target information. AiATrack introduced an attention in attention (AiA) module that enhances appropriate correlations and suppresses ambiguous correlations in order to suppress the noise of the existing attention mechanism. By introducing a model update method that directly reuses previously encoded cached features, they propose a simplified tracking process that effectively utilizes short-term and long-term references, showing remarkable performance.
In addition, active and vibrant research has been conducted on transformer-based tracking methods that adopt a lightweight backbone for aerial tracking 54,55 . Unlike the trackers mentioned above, the research on trackers in which the backbone is replaced with transformers instead of existing CNNs also shows remarkable performance 60,61 .

Figure 1.
Qualitative comparison between state-of-the-arts. This figure shows the results of the proposed tracker AMST 2 and three state-of-the-art trackers on some challenging video sequence (Animal2, Vaulting from DTB70, and Bike2, Truck1 from UAV123). The AMST 2 tracker demonstrates superior performance over other algorithms by combining multi-level spatial and temporal context while adding the template update mechanism of feature-level. www.nature.com/scientificreports/ Multi-level spatial and temporal information-based visual tracking. Incorporating both spatial and temporal information is crucial for enhancing performance in the field of object tracking. There are many trackers that use multi-level spatial feature to extract the relationship between the template and the current search region according to the spatial dimension 12,26,29,30,54 . The tracker using multi-scale features has the advantage of being able to robustly track the localization of objects of various scales. Dynamic template-based trackers, such as Updatenet 45 and SiamTOL 44 , have been developed to enhance tracking performance by utilizing temporal information. In particular, TCTrack introduced a tracking method considering the temporal contexts of two levels, including the search feature level and the similarity map level 55 . Trackers that take into account temporal information can achieve robust performance by capturing changes in the state of the object across frames. However, when using multi-level spatial and temporal information separately, there is a problem that the complementary advantages of the two information cannot be utilized. To address this limitation, a method has been introduced to improve the robustness of the tracker by integrating spatial and temporal information through simultaneous learning with the transformer, as demonstrated in the STARK tracker 49 .
Aerial visual tracking. Due to the technological advancements in UAVs equipped with visual tracking capabilities, aerial tracking has been widely applied in sectors such as aviation, agriculture, transportation, and defense 1-3 . One significant challenge in aerial tracking arises from image distortion caused by UAV flight vibrations and complex environments. Specially, in aerial tracking, when UAVs flying at a high altitude captures an object on the ground, it is difficult to extract rich features due to the small size of the object. While deep learningbased trackers have demonstrated superiority on various UAV datasets, the limited resources of aerial platforms hinder the use of heavy models and limit tracking performance improvement. To address these challenges, several specialized trackers have been developed using different UAV datasets. AutoTrack is a DCF-based tracker that automatically tunes the hyperparameters of the space-time regularization, demonstrating high performance on CPU 62 . COMET improves tracking accuracy by proposing context-aware IoU-guided tracker that utilizes a multi-task two-stream network for small object tracking and an offline reference proposal generation strategy 63 . Additionally, adopting an anchor proposal network to generate high-quality anchors for light-weight Siamese network-based trackers has shown excellent aerial tracking performance 52,53 . Moreover, employing a transformer to the light-weight Siamese network backbone has resulted in notable progress by enhancing the correlation map 54,55 .
The development of miniaturized embedded AI computing platforms offers a promising alternative to dedicated server GPUs, enabling continuous research and practical use in future aerial tracking endeavors.

Proposed method
In this section, we present the AMST 2 tracker for aerial tracking, which utilizes an aggregated multi-level spatial and temporal context-based transformer. The proposed tracker consists of four sub modules: (1) the Siamese feature extraction network, (2) template update network, (3) transformer module (which includes the multilevel spatial encoder, temporal encoder, aggregation encoder, and multi-context decoder), and (4) classification and regression network. To provide a clear comparison with existing tracking algorithms, we introduce baseline algorithms that utilize the multi-level spatial encoder, temporal encoder, and template update network. We then propose an extension to these baseline algorithms by adopting an aggregation encoder that combines the representations learned by the multi-level spatial and temporal encoders, along with a modified decoder for tracking. A visual representation of our method can be seen in Fig. 2, and we provide further details on the approach below. Feature extraction network. As a feature extraction backbone, deep CNNs such as GoogLeNet 64 , MobileNet 65 , and ResNet 38 have been widely used in various trackers. However, the heavy computation requirements limit their employment in embedded platforms such as UAVs.
To solve this problem, we transformed a light-weight feature extractor such as AlexNet with additional convolution layers into online temporally adaptive convolution (TAdaConv) 66 , inspired by 55 . TAdaConv considers the temporal context at the search feature level. A typical convolutional layer shares learnable weights and bias in the entire tracking sequence. On the other hand, the parameters of the online convolution layer are calculated by a calibration factors that are varied for each frame and learnable weights and bias. As a result, it is possible to extract features that contain temporal information at the feature level using the convolutional weight dynamically calibrated by the previous frame. Since TAdaConv is calibrated using global descriptors of the feature in the previous frames, the tracking performance with temporal adaptive convolutional network (TAdaCNN) improves remarkably despite a diminutive frame rate drop. For more details on how to transform a standard convolution layer into TAdaConv, please refer to 55,66 . Utilizing both low-level and high-level convolution layers' features improves tracking accuracy. Therefore, using TAdaCNN φ as the backbone, multi-level spatial information is obtained by calculating the similarity map using the hierarchical features of the TAdaCNN's multi-layer at the t-th frame.
where Z and X represent template and search image respectively. ⊛ denotes depth-wise cross correlation and φ i www.nature.com/scientificreports/ Transformer encoder. The similarity maps calculated using the hierarchical features of multi-level layer of backbone are pre-processed before being fed into multi-level spatial and temporal encoders. The architecture of the proposed transformer encoder is shown in Fig. 3. First, the similarity maps R 3 t , R 4 t and R 5 t obtained from t-th frame are passed through the convolutional layer. Afterwards, the refined similarity maps T t ∈ R HW×C , S 3 t ∈ R HW×C , S 4 t ∈ R HW×C , and S 5 t ∈ R HW×C can be obtained using reshape operation ( T t can be obtained by copying S 5 t , such that T t = S 5 t ). The attention mechanism is a crucial component in a standard transformer. It involves using the query, key, and value represented as Q, K, and V , respectively. The attention function in a standard transformer is typically defined as scale dot-product attention, which can be expressed as: where 1/ √ d k is a scaling factor to control the softmax distribution and avoid gradient vanishing problem. By extending the attention module to multiple heads, the model can extract representations in multiple subspaces as follows: Figure 2. The overall tracking process of the proposed tracker. The AMST 2 tracker is composed of four main components: a Siamese feature extractor, template update network, transformer, and classification and regression network. The transformer module consists of multi-level spatial, temporal, and aggregation encoders, along with a multi-context decoder. The multi-level spatial encoder takes the similarity map generated from the 3rd and 4th layer features as input, while the temporal encoder uses the similarity map generated from the 5th layer features and the output of the previous temporal encoder (indicated by the blue dotted line) as input. The aggregation encoder receives the outputs of multi-level spatial and temporal encoders as inputs. The multicontext decoder uses the outputs of all encoders and the similarity map generated with 5th layer features as inputs. Furthermore, the template update process incorporates an update patch, previous template features, and initial template features. This process is executed either during each specific frame or under certain conditions to update the template.
are learnable weight matrices, Concat(·) represents the concatenation and N is the number of attention head.
Multi-level spatial encoder. Cao et al. utilized a combination of multi-level spatial information to fully explore inter-dependencies between hierarchical features 54 . Specifically, with learnable position encoding, S 3 t and S 4 t are combined using addition and a normalization to obtain M 1 t , i.e., M 1 t = Norm S 3 t + S 4 t , which is then fed into a multi-head attention layer to obtain M 2 t using the equation in (3).
As shown in (4), by considering the global context of S 3 t and S 4 t and learning the inter-dependencies of the two feature maps, M 2 t is enhanced to a high-resolution feature map. Thereafter, M 3 t can be obtained by add operation and normalization layer, i.e., M 3 t = Norm M 2 t + S 3 t . To fully explore the inter-dependencies between M 3 t and S 4 t , we adopt a modulation layer. The modulation layer can efficiently exploit the internal spatial information of between M 3 t and S 4 t , the output M 4 t of modulation layer can be expressed as: where FFN(·) denotes a feed-forward network (FFN), GAP(·) denotes a global average pooling (GAP), and γ and F (·) represent learning weight and convolution layer, respectively. The final output M m t ∈ R HW×C of multi-level spatial encoder can be expressed as: The compressed embedding features of the multi-level spatial encoder not only effectively discriminate objects from the scale variation scenario, but are also robust to small object detection. The multi-level spatial encoder is shown in Fig. 3a.
Temporal encoder. Aside from using temporal information at the feature level, Cao et al. refined the similarity map using temporal prior knowledge by integrating both the previous knowledge and the current information at the similarity level 55 . The temporal context-based encoder structure is composed of three multi-head attention layers and one temporal information filter. The temporal encoder is shown in Fig. 3b. Given the previous prior knowledge T m t−1 and the current similarity map T t as inputs of the encoder, T 1 t can be obtained using the first multi-head attention layer.
Then, T 2 www.nature.com/scientificreports/ included, which degrades tracker performance when temporal information of the entire frame is exploited.
To solve this problem, the temporal information filter can be obtained by feeding the global descriptor of T 2 t , which is the result of GAP into the FFN. The temporal information filter and the filtered information T f t can be expressed as: where f is the temporal information filter. The temporal knowledge of the t-th frame T m t ∈ R HW×C as the final output of the temporal encoder can be expressed as: where Norm(·) denotes normalization layer. Notably, the first frame has a problem in that there is no distinguishing characteristic of the previous frame. Therefore, by convolution operation, the initial similarity map is set to represents the initial convolution layer.
Aggregation encoder. In order to improve tracking performance by utilizing integrated multi-level spatial information and temporal information, we propose an aggregation encoder that aggregates the outputs of the multi-level spatial and temporal encoders. The aggregation encoder modifies the multi-head attention layer of the standard encoder, allowing the output of the multi-level spatial encoder to be injected into the output of the temporal encoder. The attention weight for the aggregation encoder can be expressed as follows, given the outputs M m t and T m t of each encoder: /N are learnable weight of the linear layer and j is the index of the head. According to (11), the output of the j-th head and the output H of modified multi-head attention layer can be expressed as by: where W O ∈ R C×C are learnable weight matrices and N is the number of attention head. Afterwards, A 1 t can be obtained by using add operation and normalization layer, i.e., A 1 t = Norm T m t + H . Finally, the output A m t of the aggregation encoder can be obtained by: The output of the aggregation encoder integrates multi-level spatial and temporal information to generate more powerful features omplex scenarios. The detailed structure of aggregation encoder is shown in Fig. 3c.
Transformer decoder. We propose a multi-context decoder to utilize both high-resolution and low-resolution information, and further exploit the interrelation between current spatial features and temporal knowledge. The proposed multi-context decoder introduces a structure that integrates the refined multi-context features using the outputs of the multi-level spatial and temporal encoders. Therefore, we adopt three multi-head attention differently from the decoder structure of the standard transformer. Also, after the first multi-head attention, the output of the aggregation encoder was used for the key, and the output of the multi-level spatial and temporal encoders were used for the value, respectively. Therefore, the proposed method not only maintains the feature information of each of the multi-level spatial and the temporal encoders, but also obtains the feature with increased attention at a corresponding location containing the multi-context information based on the valid information of the location containing the aggregated multi-context information of the aggregation encoder. The positional encoding of the multi-level spatial encoder is used to distinguish each location on the feature map. However, in order to avoid direct influence on the multi-context-based transformed features, the decoder is designed without positional encoding and implicitly receives the positional information of the multi-level spatial encoder 54 . The multi-context decoder is shown in Fig. 4.
The current low-resolution similarity map S 5 www.nature.com/scientificreports/ where D 2 t is the result of set the key and value to A m t and M m t , respectively, and D 3 t is the result of set the key and value to A m t and T m t , respectively. The final result D * t of the transformer containing multi-context information can be obtained by using D 2 t and D 3 t obtained from (15).
Template update. Despite using temporal context information through TAdaCNN, the updating of temporal information only at the feature level of the search can lead to high failure of the tracker due to inconsistency between the search and the template feature over time. In addition, when updating a template using backbone network, the information of the initial template which is a non-contaminated sample can be lost and violates the criteria of visual tracking to track arbitrary object using an initial template. We adopt the template update network as a feature fusion network 44 to combine the features of the initial template and the update sample and can be seen in Fig. 2. Given the template and the update sample in the k-th frame, the updated template Ẑ k using the template update network is calculated as: where Z 1 and U k denotes the initial template and the k-th frame updated image, respectively. Z i k and φ i 1 (Z 1 ) respectively represent the previous updated template and the initial template feature of the first frame. ψ i k (·) represents the template update network. Z i k is initialized to φ i 1 (Z 1 ) in the first updating process. The template update network consists of three 1 × 1 convolutional layers with different channels of C, C/2, and C. Each of the first two convolutional layers is followed by a ReLU. We update the template every δ frames or when the confidence score is lower than the threshold τ . The template update network can learn powerful representations of object appearance changes and can prevent tracking failure due to extreme drift over time.
Network training loss. The proposed loss function consists of two branches for classification and regression tasks, similar to the HiFT tracker 54 . The first classification branch computes the foreground and background scores of a given location, while the second branch measures the distance contrast between the location and the center of the ground-truth to remove low-quality boxes. For regression, a linear combination of the L1-norm and the complete-IoU (CIoU) 67 is used. The regression loss can be formulated as:  www.nature.com/scientificreports/ where b j is the j-th predicted bounding box and b gt is its corresponding ground-truth box, c j and c gt respectively represent the center of the predicted and ground-truth boxes, ρ(·) represents Euclidean distance, and d is the diagonal length of the box covering the predicted bounding box and the ground-truth box, and υ represents the correspondence between the aspect ratios of the predicted bounding box and the ground-truth box, and α is a positive trade-off parameter, which controls the balance between non-overlapping cases and overlapping cases, and I = 1 , C = 0.5 , and L1 = 0.5 are the regularization parameters in our experiments.The total loss function can be expressed as: where 1 = 1 , 2 = 1 , and 3 = 1.2 are the regularization parameters in our experiments. The feature extractor of the proposed model includes a Siamese network and a template update network to control features online. However, training the network with only a total loss can lead to over-fitting and a dilemma in balancing the function between the Siamese network and the template update network. To address this issue, we adopt a multi-aspect loss training method 44 . The multi-aspect training loss includes three aspects. Firstly, L template loss is based on the template sample and the search region to allow the network to track like an existing Siamese tracker using the template. Secondly, L update loss is obtained using the update sample and the search region, which can also be regarded as a template sample, resulting in a complementary sample data augmentation effect. Thirdly, L overall loss is obtained by using the updated template, which is the output of the template update network, and the search area to learn to track the location of an object using the updated template information. Finally, L final loss is expressed as: where L template , L update , and L overall are constructed as L total of (19) loss obtained using template sample, update sample, and updated template feature, respectively.

Experimental results
In this section, we conducted comprehensive experiments of the proposed tracker AMST 2 on various UAVs datasets including DTB70 68  Implementation details. Training. In the training phase, AMST 2 was trained on ImageNet VID 19 , COCO 79 , GOT-10K 80 , and LaSOT 81 datasets. We exploited three samples for training. We used the same patch size 127 × 127 for both template and update, and used the search patch of size 287 × 287. Our backbone is an AlexNet with the last three layers converted by TAdaConv and initialized with pre-trained weights from Ima-geNet. For efficient learning of the temporal context of TAdaConv, we used one search patch in a half and two search patches in one third for the entire epoch, respectively, and three search patches for the remaining epochs. The transformer architecture consists of one multi-level spatial encoder layer, one temporal encoder layer, one aggregation encoder layer and two multi-context decoder layers. Our whole networks are trained with stochastic gradient descent (SGD) with momentum and weight decay of 0.9 and 0.0001, respectively. The batch size was 180 and the network was trained for 100 epochs. For the first 20 epochs, the layers of backbone are frozen and the remaining epochs fine-tune the last three layers. We used a warm-up learning rate from 0.005 to 0.01 in the first 10 epochs and a decreasing learning rate from 0.01 to 0.00005 in log space in the remaining epochs. The training process was conducted with two NVIDIA RTX 3090 GPUs.
Testing. In the inference phase, to obtain the initial temporal prior knowledge, we calculated the correlation between the template and search patches using only the initial frame. Afterwards, the smooth object tracking was possible by continuously matching the feature of the search area cropped based on the object position of the previous frame with the template feature obtained in the initial frame or the updated template feature through the template update network. The threshold τ of the template update process was set to 0.8. In addition, δ was set to 50 for short-term aerial tracking datasets such as DTB70 and 150 for long-term aerial data sets such as UAV123. In order to smooth the motion of the object, the cosine window and the scale change penalty are applied for the predicted box to eliminate the boundary outliers and minimize the large changes in size and ratio 5,37 . After that, by selecting the prediction box with the best score, the size of the bounding box is updated by linear interpolation. Fig. 2 shows a whole tracking process, where our tracker operates on a single NVIDIA RTX 3090 GPU for real-time tracking. www.nature.com/scientificreports/ Evaluation metrics. We employed One Pass Evaluation (OPE) 69,82 to evaluate the proposed method. OPE is based on two metrics: (1) precision and (2) success rate. The precision exploits the center location error (CLE) between the predicted bounding box and the groundtruth box.
where c t and c gt t respectively represent the center of the t-th predicted and ground-truth bounding boxes, and �·� is the Euclidean distances. The precision plot displays the percentage of frames where the center location error is below a specific threshold. A threshold of 20 pixels is utilized to evaluate and rank the trackers.
The success rate is calculates overlap as the IOU between the predicted and ground-truth bounding boxes. The overlap ratio OR t in the t-th frame is expressed as: where ∩ and ∪ respectively represent intersection and union of regions of two boxes, and |·| is the number of pixels in the region. The success plot shows the percentage of successful frames whose overlap ratio is beyond a pre-defined threshold varied from 0 to 1. The area under curve (AUC) score of the success plot is mainly adopted to rank the trackers.
Quantitative evaluation with the light-weight trackers. Evaluation on DTB70. DTB70 68 contains 70 challenging sequences constructed from data collected by UAVs. In addition, various challenging scenes with translation, rotation, and different size and aspect ratio due to camera motion further complicate the dataset. The robustness of our tracker in various complex scenarios caused by the fast motion of the UAV can be demonstrated with this benchmark. As a result of comparison with other trackers, AMST 2 achieved precision (0.851) and success rate (0.658), ranking first place, and the results are shown in Fig. 5. Compared to the second-best and third-best place TCTrack (0.815) and HiFT (0.804), the precision improved by about 4.4% and 5.8% , respectively. Similarly, in success rate, AMST 2 has 6.0% and 10.8% performance increase over TCTrack (0.621) and HiFT (0.594), respectively.
Evaluation on UAV123. The UAV123 69 is a large-scale aerial tracking benchmark collected from an aerial viewpoint consisting of a total of 123 video sequences containing over 112 K frames. The object in the dataset are difficult to track due to large-scale change, illumination change, and occlusion, especially small object. As shown in Fig. 5, the AMST 2 outperforms all other trackers for both precision and success rate. In terms of precision, the proposed method surpasses the second-best TCTrack (0.800) and third-best HiFT (0.787) by 4.0% and 5.7% , respectively, with a precision score (0.832). The success rate also achieved an better performance increase of about 4.3% and 7.0% , respectively, compared to the baseline trackers. Figure 5. Comparison of overall performance with the light-weight trackers. The evaluation used the precision and success plots of the proposed tracker and 29 other light-weight trackers. www.nature.com/scientificreports/ Evaluation on UAV123@10fps. The UAV123@10fps 69 is downsampled by adopting the 10FPS image rate of the original version UAV123. The tracking problem is more challenge than the original version because the movement displacement and variation of the object are larger. As shown in Fig. 5, our tracker achieves the best performance in terms of both precision (0.798) and success rate (0.616). This clearly shows that our tracker is capable of robust tracking in discontinuous aerial data with no performance degradation due to image frame rate.
Evaluation on UAV20L. The UAV20L 69 was used for long-term tracking performance evaluation. This benchmark is a subset of UAV123 and consists of 20 long-term tracking sequences with an average of 2934 frames. As shown in Table 1, AMST 2 attains first place with a precision of 0.784, ahead of second and third-best place TCTrack (0.780) and HiFT (0.763) by small margin of about 0.5% and 2.8% , respectively. Also, the success rate of AMST 2 has the best score (0.601), showing better tracking performance than TCTrack (0.580) and HiFT (0.566). This represents that the proposed method generates better features for tracking than existing methods on long-term datasets.
UAVTrack112_L 70 is a well-known long-term tracking dataset designed for aerial tracking, comprising of over 60,000 frames and a subset of UAVTrack112 70 . As demonstrated in Table 2, AMST 2 is a more resilient tracker compared to state-of-the-art trackers. AMST 2 secures the top spot with a precision score of 0.835, surpassing TCTrack (0.786) and SiamRPN++ (0.769) by approximately 6.2% and 8.6%, respectively. In terms of success rate (0.629), AMST 2 also demonstrates superior performance to other trackers. These results confirm the superiority of our tracker over existing light-weight trackers in long-term benchmarks.
Attribute comparison. Due to the severe motion of UAV, aerial tracking faces various challenges. Attributes were annotated in the benchmark datasets, as shown in Figs. 6 and 7 to evaluate the tracker performance under various challenging conditions. Figure 6 illustrates that the proposed tracker outperforms other light-weight trackers in several challenging scenarios on the DTB70 and UAV123 benchmarks. Figure 7 depicts the evaluation results of all attributes on the UAV123@10fps benchmark. In terms of precision, our tracker secures the second-best position in low-resolution and similar object conditions, and first place in all other attributes. Particularly, AMST 2 demonstrates the highest success rate among all attributes in the UAV123@10fps dataset. By utilizing multi-level spatial and temporal Table 1. Overall performance on UAV20L. The best three performances are respectively highlighted with bolditalic, italic, and bold. www.nature.com/scientificreports/ information, our tracker exhibits exceptional performance in various scenarios, such as scale variation, deformation, fast camera motion, and occlusion, among others. Moreover, template updates at the template feature level provide an advantage of more robust tracking for extreme variations.

Ablation study.
To validate the impacts of the proposed method, we performed several ablation studies on DTB70 dataset. We evaluated five variants of our tracker, including: (1) MS, which uses only the features of the multi-level spatial encoder as the first baseline, (2) TE, which utilizes only a temporal encoder as the second baseline, (3) MS+TE, which applies both multi-level spatial and temporal encoders, (4) MS+TE+TU, a model in which a template update network is added to MS+TE, and (5) MS+TE+AE+TU, the final model that includes the aggregation encoder added to MS+TE+TU. In this ablation study, the same multi-context decoder structure was used about the method of applying both multi-level spatial and temporal information. As shown in Table 3, our contribution not only demonstrates outstanding performance in various complex conditions, but also shows the highest score in precision and success rate.
Quantitative evaluation with the deep trackers. Our goal was to enhance the robustness of our proposed aerial tracking by combining multi-level spatial and temporal information, and thus handle complex conditions. To obtain clearer results, we compared our method with 22 state-of-the-art trackers with deeper  www.nature.com/scientificreports/ backbones. As depicted in Fig. 8, even though our method uses a light-weight backbone, it achieves competitive performance with a significantly faster tracking speed than AiATrack, which has the highest success rate. Furthermore, we conducted comparison experiments on all scenarios of the DTB70 using the top 10 tracking speed-based trackers to support the attribute-based analysis with deep trackers. As shown in Fig. 9, our tracker outperforms others in various complex and cluttered scenarios. The proposed robust feature representation, which aggregates multi-level spatial and temporal context, reduces the performance gap with deeper backbonebased trackers and ensures efficient and robust tracking in various aerial scenes. Table 4 presents an in-depth comparison between the proposed method and deeper backbone-based trackers, as well as baseline trackers. we conducted evaluations on multiple factors including frames per second (fps), parameters, and performance metrics using well-known aerial datasets such as VisDrone-SOT2020 71 and UAVDT 72 . VisDrone-SOT2020 is based on data collected from numerous real-world situations on weather and lighting variations, and UAVDT also includes various frames in complex scenarios that confuse tracker performance such as weather, altitude, camera view, object appearance, and occlusion. For clarity, STARK and TransT use a modified version of ResNet that removes the last stage, so they have a fewer number of parameters than trackers using the other deeper backbones. HiFT, TCTrack, and the proposed tracker show faster processing time with much less parameters and tracking speeds of more than 100 fps than deep trackers. In addition, HiFT and TCTrack have advantages in parameters and fps over the proposed tracker, but in terms of performance, they underperform deep trackers and the proposed tracker. Furthermore, our proposed tracker not only demonstrates lower parameter complexity compared to TransT, which achieved the highest score in VisDrone-SOT2020, but also exhibits similar preci- Table 3. Ablation analysis on DTB70 dataset. The red and blue arrows denote improvement compared to baseline 1 and baseline 2, respectively and the down and up arrows indicate scores lower and higher than baseline, respectively. www.nature.com/scientificreports/ sion performance and comparable success performance to deeper backbone models, even with a doubled fps. These results highlight the efficiency and effectiveness of our proposed tracker in terms of parameter usage and overall tracking performance, showcasing its potential for real-time aerial tracking applications. In the UAVDT dataset, the proposed method shows a comparable performance to state-of-the-art trackers, while maintaining low parameter complexity and fast processing speed. These findings further demonstrate the effectiveness and efficiency of our proposed method in aerial tracking tasks. Among the deeper backbone-based trackers, there are trackers close to 100 fps, but the proposed tracker outperforms in terms of parameters and performance. Therefore, our tracker demonstrates higher efficiency in aerial tracking using UAVs than many SOTA trackers with low latency, fast tracking speed and superior performance.

Conclusion
In this paper, we presented the aggregated multi-level spatial and temporal context-based transformer (AMST 2 ) architecture, a novel approach for robust aerial tracking that leverages multi-level spatial and temporal information through a Transformer-based model. The proposed approach includes an aggregation encoder that enhances the similarity map and a multi-context decoder that generates powerful refined similarity maps. The utilization of an aggregated multi-level spatial and temporal information-based transformer, along with a light-weight backbone, effectively addresses the challenges of tracking speed and aerial tracking when employing UAVs. The adoption of a template update process further enhances the robustness of our approach against complex scenarios. Extensive experiments on challenging aerial benchmarks, including DTB70, UAV123, UAV123@10fps, UAV20L, and UAVTrack112 _ L, demonstrated that AMST 2 outperforms state-of-the-art methods in terms of both accuracy and efficiency.
While our approach shows promising results, there are still limitations to be addressed, such as the sensitivity to low-lighting conditions and the need for a large amount of training data. Future research can investigate ways to overcome these limitations and further improve the accuracy and efficiency of aerial tracking. Overall, the proposed approach represents a significant advancement in the development of more robust and effective aerial tracking systems.

Data availibility
All data generated or analyzed in this study are included in this published article. The training and testing datasets used in this study are publicly available and have been cited in accordance with research rules. Detailed descriptions of the datasets and their citations can be found in the "Experimental results" section of the paper. For instance, the ImageNet VID dataset's training set can be downloaded from the link https:// image-net. org/ chall enges/ LSVRC/ 2015/ index. php. The COCO dataset's training set can be downloaded from https:// cocod ataset. org/# home, while the GOT-10K dataset's training set can be downloaded from http:// got-10k. aites tunion. com/. Furthermore, the LaSOT dataset's training set can be accessed via http:// vision. cs. stony brook. edu/ ~lasot/. The testing sets of the DTB70 dataset, the UAV123, UAV123@10fps and UAV20L datasets, and the UAVTrack112_L dataset, VisDrone-SOT2020 dataset and UAVDT dataset can be downloaded from https:// github. com/ flyers/ drone-track ing, https:// cemse. kaust. edu. sa/ ivul/ uav123, https:// github. com/ visio n4rob otics/ SiamA PN, http:// aisky eye. com/, and https:// sites. google. com/ view/ grli-uavdt, respectively.